Week 1: Exploratory Data Analysis and Feature Engineering¶

Student Name 1, Student Name 2

Aims¶

By the end of this notebook you will

  • understand and play with the different aspects of data pre-processing
  • be familiar with tools for exploratory data analysis and visualization
  • understand the basics of feature engineering
  • build your first pipeline

Topics and Instructions¶

  1. Problem Definition and Setup

  2. Exploratory Data Analysis

  3. Data Preprocessing

  4. Feature Engineering

  5. Summary

In lecture this week, we reviewed the general machine learning pipline, which following the "Machine Learning Project Checklist" of Geron (2019) can be stuctured as:

  • Frame the problem and look at the big picture.
  • Get the data.
  • Explore the data and gain insights.
  • Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
  • Explore many different models and shortlist the best ones.
  • Fine-tune your models and combine them into a great solution.
  • Present your solution.
  • Launch, monitor, and mantain your system.

In this week's workshop, we will focus on the initial steps of this pipeline, that is on, data pre-processing, exploratory data analysis and feature engineering.

During workshops, you will complete the worksheets together in teams of 2-3, using pair programming. During the first few weeks, the worksheets will contain cues to switch roles between driver and navigator. When completing worksheets:

  • You will have tasks tagged by (CORE) and (EXTRA).
  • Your primary aim is to complete the (CORE) components during the WS session, afterwards you can try to complete the (EXTRA) tasks for your self-learning process.
  • Look for the 🏁 as cue to switch roles between driver and navigator.
  • In some Exercises, you will see some beneficial hints at the bottom of questions.

Instructions for submitting your workshops can be found at the end of worksheet. As a reminder, you must submit a pdf of your notebook on Learn by 16:00 PM on the Friday of the week the workshop was given.

Problem Definition and Setup ¶

Packages¶

Now lets load in some packages to get us started. The following are widely used libraries to start working with Python in general.

In [36]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

If you need to install any packages from scratch, you need to install the related library before calling it. For instance, feature-engine is a Python library for Feature Engineering and Selection, which:

  • contains multiple transformers to engineer and select features to use in machine learning models.

  • preserves scikit-learn functionality with methods fit() and transform() to learn parameters from and then transform the data (we will learn more about these throughout the course!).

In [37]:
# To install the feature-engine library (if not already installed)
!pip install feature-engine
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Requirement already satisfied: feature-engine in f:\anaconda\envs\mlp\lib\site-packages (1.9.3)
Requirement already satisfied: numpy>=1.18.2 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.26.4)
Requirement already satisfied: pandas>=2.2.0 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (2.3.0)
Requirement already satisfied: scikit-learn>=1.4.0 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.7.0)
Requirement already satisfied: scipy>=1.4.1 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (1.15.3)
Requirement already satisfied: statsmodels>=0.11.1 in f:\anaconda\envs\mlp\lib\site-packages (from feature-engine) (0.14.6)
Requirement already satisfied: python-dateutil>=2.8.2 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in f:\anaconda\envs\mlp\lib\site-packages (from pandas>=2.2.0->feature-engine) (2025.3)
Requirement already satisfied: six>=1.5 in f:\anaconda\envs\mlp\lib\site-packages (from python-dateutil>=2.8.2->pandas>=2.2.0->feature-engine) (1.17.0)
Requirement already satisfied: joblib>=1.2.0 in f:\anaconda\envs\mlp\lib\site-packages (from scikit-learn>=1.4.0->feature-engine) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in f:\anaconda\envs\mlp\lib\site-packages (from scikit-learn>=1.4.0->feature-engine) (3.6.0)
Requirement already satisfied: patsy>=0.5.6 in f:\anaconda\envs\mlp\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (1.0.2)
Requirement already satisfied: packaging>=21.3 in f:\anaconda\envs\mlp\lib\site-packages (from statsmodels>=0.11.1->feature-engine) (25.0)

In some cases, we may need only a component of the whole library. If this is the case, it is possible to import specific things from a module (library), using the following line of code:

In [38]:
from feature_engine.imputation import EndTailImputer

Problem¶

Now, it is time move on to the next step.

You are asked to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.

Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

The first question to ask your boss is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.

The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem. Your boss answers that the district housing prices are currently estimated manually by experts: a team gathers up-to-date information about a district, and when they cannot get the median housing price, they estimate it using complex rules.

This is costly and time-consuming, and their estimates are not great; in cases where they manage to find out the actual median housing price, they often realize that their estimates were off by more than 20%. This is why the company thinks that it would be useful to train a model to predict a district’s median housing price given other data about that district. The census data looks like a great dataset to exploit for this purpose, since it includes the median housing prices of thousands of districts, as well as other data.


🚩 Exercise 1 (CORE)¶

Using the information above answer the following questions about how you may design your machine learning system.

a) Is this a supervised or unsupervised learning task?

Type your answer here!

b) Is this a classification, regression, or some other task?

Type your answer here!

c) Suppose you are only required to predict if a district's median housing prices are "cheap," "medium," or "expensive". Will this be the same or a different task?

Type your answer here!

Data Download¶

The data we will be using this week is a modified version of the California Housing dataset. We can get the data a number of ways. The easiest is just to load it from the working directory that we are working on (where we have already downloaded it to).

In [39]:
housing = pd.read_csv("housing.csv")
housing.head()
Out[39]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

Exploratory Data Analysis ¶

In this section we are going to start with exploring the California Housing data using methods that you will likely already be familiar with.

Data can come in a broad range of forms encompassing a collection of discrete objects, numbers, words, events, facts, measurements, observations, or even descriptions of things. Processing data using exploratory data analysis (EDA) can elicit useful information and knowledge by examining the available dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions.

Let's start by examining the Data Dictionary and the variables available:

longitude: A measure of how far west a house is; a higher value is farther west

latitude: A measure of how far north a house is; a higher value is farther north

housingMedianAge: Median age of a house within a block; a lower number is a newer building

totalRooms: Total number of rooms within a block

totalBedrooms: Total number of bedrooms within a block

population: Total number of people residing within a block

households: Total number of households, a group of people residing within a home unit, for a block

medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

medianHouseValue: Median house value for households within a block (measured in US Dollars)

oceanProximity: Location of the house w.r.t ocean/sea

🚩 Exercise 2 (CORE)¶

a) Examine the datatypes for each column calling info(). What is the total number of observations and total number of variables? What is the type of each variable?

In [40]:
# Code for your answer here!
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Type your answer here!

b) From the information provided above, can you anticipate any data cleaning we may need to do?

4 total bedrooms has null and we need to encode the categorical ocean_proximity by one-hot encoding

🚩 Exercise 3 (CORE)¶

a) Use descriptive statistics and histograms to examine the distributions of the numerical attributes.

Hint
  • .describe() can be used to create summary descriptive statistics on a pandas dataframe.
  • You can use a sns.histplot to create histograms
In [41]:
# Code for your answer here!
import matplotlib.pyplot as plt

housing.describe()

housing.hist(bins = 500, figsize = (10,10))
plt.tight_layout()
plt.show()
No description has been provided for this image

b) Can you identify other pre-processing/feature engineering steps we may need to do? Which variables represent counts and how are they distributed?

1,Missing values; encoding by one-hot encoding; using log1p handle very skewed counts 2,All these variable except ocean_proximity,right - skewed

c) One thing you may have noticed from the histogram is that the median income, housing median age, and the median house value are capped. The median house value capping (this being our target value) may or may not be a problem depending on your client. If we needed precise predictions beyond $\$500,000$, we may need to either collect proper labels/outputs for the districts whose labels were capped or remove these districts from the data. Following the latter, remove all districts whose median house value is capped. How many observations are there now?

In [42]:
# Code for your answer here!
# Remove the cases where median_house_value >= 500,000$
cap = housing["median_house_value"].max()
housing_no_cap = housing[housing["median_house_value"]<cap]
housing_no_cap.shape[0]
Out[42]:
19675

🚩 Exercise 4 (CORE)¶

What are the possible categories for the ocean_proximity variable? Are the number of instances in each category similar?

Hint
  • value_counts() can be used to count the values of the categories a pandas series.
  • You can use a sns.countplot to create barplot with the number of instances of each category
In [43]:
housing["ocean_proximity"].unique()          # 所有类别
housing["ocean_proximity"].value_counts()    # 每类数量
housing["ocean_proximity"].value_counts(normalize=True)  # 每类比例
Out[43]:
ocean_proximity
<1H OCEAN     0.442636
INLAND        0.317393
NEAR OCEAN    0.128779
NEAR BAY      0.110950
ISLAND        0.000242
Name: proportion, dtype: float64

Type your answer here!

🏁 Now, is a good point to switch driver and navigator

🚩 Exercise 5 (CORE)¶

Examine if/which of the features are correlated to each other. Are any of the features correlated with our output (median_house_value) variable?

  • Can you think of any reason why certain features may be correlated?

  • How might we use this information in later steps of our model pipeline?



Hint
  • .corr() can be used to compute the correlations.
  • You can use a sns.heatmap to visualize the correlations
In [44]:
# Code for your answer here!
# correlation matrix (numeric only)
corr = housing.corr(numeric_only=True)
# heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.tight_layout()
plt.show()
No description has been provided for this image

Type your answer here! median_income has a strong relationship with median house value People who earn more tend to buy moreexpensive houses later we can reduce multicollinearity

🚩 Exercise 6 (CORE)¶

Use sns.pairplot to further investigate the joint relationship between each pair of variables. What insights into the data might this provide over looking only at the correlation?

In [45]:
# Code for your answer here!
sns.pairplot(housing, diag_kind="hist", plot_kws={"alpha": 0.3, "s": 15})
plt.tight_layout()
plt.show()
No description has been provided for this image

Type your answer here!

Data Pre-Processing ¶

Now we have some familiarity with the data though EDA, lets start preparing our data to be modelled.

Data Cleaning¶

Let's start with some basic data cleaning steps. For example, we may want to:

  • deal with duplicated, inconsitencies or typos in the data,
  • handle missing data,
  • remove uninformative features (e.g. subject identifiers),
  • fix variable types,
  • adjust data codes (e.g. missing variables may be coded as ‘999’ instead NA),
  • optionally remove outliers.

Let's start with the former.

Data Duplication and Errors¶

We want to remove duplicates, that may have accidently been entered in the database twice, as they may bias our fitted model. In other words, we may potentially overfit to this subset of points. However, care should usually be taken to check they are not real data with identical values.

There a number of ways we could identify duplicates, the simplist (and the approach we'll focus on) is just to find observations with the same feature values. Of course this will not identify things such as spelling errors, missing values, address changes, use of aliases, etc. This may commonly happen with categorical or text data, and checking the unique values is recommended. In general for such errors, more complicated methods along with manual assessment may be needed.

🚩 Exercise 7 (CORE)¶

a) Are there any duplicated values in the data? If so how many?

Hint With Pandas dataframes you can use `.duplicated()` to get a boolean of whether something is a duplicate and then use `.sum()` to count how many there are.

b) What are the unique values of the categorical variable? Are there any duplicated categories arising from misspellings?

In [46]:
# Code for your answer here!
housing.duplicated().sum()
housing["ocean_proximity"].unique()
housing["ocean_proximity"].value_counts(dropna=False)
Out[46]:
ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

Type your answer here!

Outlier Detection¶

An Outlier is a data point that lies abnormally far from other observations and may distort the model fit and results. In general, it is a good idea to examine if any outliers are present during preprocessing. In some cases, you may want to drop these observations or cap their values (see https://feature-engine.trainindata.com/en/1.8.x/api_doc/outliers/index.html). However this may not be appropriate without explicit knowledge and testing if they are really outliers or not. In particular, when you drop or cap those observations you can discard important information unwittingly!

We will use basic statistics in order to try to identify outliers. A simple method of detecting outliers is to use the inter-quartile range (IQR) proximity rule (Tukey fences) which states that a value is an outlier if it falls outside these boundaries:

  • Upper boundary = 75th quantile + (IQR * $k$)

  • Lower boundary = 25th quantile - (IQR * $k$)

where IQR = 75th quantile - 25th quantile (the length of the box in the boxplot). This is used to construct the whiskers in the boxplot, where $k$ is a nonnegative constant which is typically set to 1.5 (the default value in sns.boxplot). However, it is also common practice to find extreme values by setting $k$ to 3.

🚩 Exercise 8 (EXTRA)¶

a) Can you identify any potential outliers using the generated boxplots below? Do you think any points should be removed?

b) Try changing $k$, defining the length of the whiskers, to 3 in sns.boxplot. Can you still identify any potential outliers?

In [47]:
fig, axes = plt.subplots(figsize = (15,10), ncols = (housing.shape[1]-1)//3, nrows = 3, sharex = True)
axes = axes.flatten()
#features_number := housing.shape[1]; ocean_proximity can't use := -1; // :=columns; 每行多少列
for i, ax in enumerate(axes):
    sns.boxplot(y = housing.iloc[:,i], ax = ax, whis=3) 
    ax.set_title(housing.iloc[:,i].name)
    ax.set_ylabel("")
    
plt.suptitle("Boxplots")
plt.tight_layout()
plt.show()
No description has been provided for this image

Type your answer here!

🏁 Now, is a good point to switch driver and navigator

Missing Data¶

Most ML models cannot handle missing values, and as we saw earlier, there are some present in total_bedrooms. We also saw that values of median_house_value are capped at $\$500,000$. This is another form of missingness, which is informative for missing values (i.e. the missing values are greater than $\$500,000$). However, we will focus on methods for dealing with missingness in our features and not the target variable.

As such, let's start by splitting our features from our target variable in the data set.

In [48]:
# Extracting the features from the data
X = housing.drop("median_house_value", axis = 1)
features = list(X.columns)
print(features)
print(X.shape)
display(X.head())
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity']
(20640, 9)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 NEAR BAY
In [49]:
# Extracting the target features from the data
y = housing["median_house_value"].copy()
print(y.shape)
display(y.head())
(20640,)
0    452600.0
1    358500.0
2    352100.0
3    341300.0
4    342200.0
Name: median_house_value, dtype: float64

There are a number of ways we can deal with missing values. The simplest is to just remove NA values. We can do this in two ways by either:

  1. Getting rid of the corresponding observations (deleting the corresponding rows).
  2. Getting rid of the whole attribute (deleting the corresponding columns).

To is relatively straight forward by running housing.dropna() with either the axis set to 0 or 1 (depending if we want to remove rows or columns) before splitting our data into features (X) and outputs (y).

🚩 Exercise 9 (CORE)¶

Use dropna() to remove the missing observations. What is the shape of the feature matrix after dropping the missing observations?

Notes

  • It may be tempting to overwrite X while working on our pre-processing steps. Don't do this! We will run these objects through our pipeline which combines missing data steps with other steps later, so if you want to test your function make sure to assign the output to tempory objects (e.g. X_).
In [50]:
# Code for your answer here!
# 不要覆盖 X
X_ = X.dropna()
X_.shape
print(X_.shape)
(20433, 9)

Instead of simply dropping missing data, we may instead want to use other imputation methods. From here on in, we will be creating functions for our data transformations. Later, we will see why this is really useful to define our model pipeline, which allows us to chain together transformations and steps in a reproducible way.

In this course we are mostly going to be using Scikit-learn, with a little Keras at the end for neural networks. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning (https://scikit-learn.org/stable/getting_started.html). It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

We will first focus on the transformer class within Scikit-learn, which provides functions for missing data imputation along with many others useful for data pre-processing and feature engineering.

Transfomers¶

If we want to alter the features of our data, we need a transfomer.

  • Transformers are classes that follow the scikit-learn API in Scikit-Learn clean, impute, reduce, expand, or generate feature representations.

  • Transformers are classes with a .fit() method, which learn model parameters (e.g. mean for mean imputation) from a training set, and a .transform() method which applies this transformation model to data. To create a custom transformer, all you need is to create a class that implements three methods: fit(), transform(), and fit_transform().

Therefore to transform a dataset, each sampler implements:

obj.fit(data)
data_transformed = obj.transform(data)

or simply...

data_transformed = obj.fit_transform(data)`

See more details: https://scikit-learn.org/stable/data_transforms.html. In the following subsections, we will see examples of transformers for categorical and numerical variables.

Data Imputation¶

Instead of removing the missing data we can set it to some value. To do this, Scikit-Learn provides various transformers, including:

  • SimpleImputer which provides simple strategies (e.g. "mean", "median" for numerical features and "most_frequent" for categorical features).
  • You can also add a missing indicator with the option add_indicator=True in SimplerImputer, or use the transfomer `MissingIndicator'. This may be useful in the case when missing features may be provide information for predicting the target (e.g. obese patients may prefer not to report bmi, thus, this missingness could be useful for estimating the risk of health conditions or diseases).
  • Beyond simple imputation strategies, sklearn also provides more advanced imputation strategies in IterativeImputer and KNNImputer
  • Other strategies are also available in feature_engine.imputation. Such as EndTailImputer, which is useful when missing values are located in the tails (e.g. capped values for privacy)

Let's start with the SimpleImputer to learn about transfomers and how to deal with missing data in sklearn.

In [51]:
from sklearn.impute import SimpleImputer

# First create the imputer object/transformer
num_imputer = SimpleImputer(strategy="median")

# Now fit the object to the data
num_imputer.fit(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[51], line 7
      4 num_imputer = SimpleImputer(strategy="median")
      6 # Now fit the object to the data
----> 7 num_imputer.fit(X)

File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\base.py:1363, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1356     estimator._validate_params()
   1358 with config_context(
   1359     skip_parameter_validation=(
   1360         prefer_skip_nested_validation or global_skip_validation
   1361     )
   1362 ):
-> 1363     return fit_method(estimator, *args, **kwargs)

File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\impute\_base.py:436, in SimpleImputer.fit(self, X, y)
    418 @_fit_context(prefer_skip_nested_validation=True)
    419 def fit(self, X, y=None):
    420     """Fit the imputer on `X`.
    421 
    422     Parameters
   (...)    434         Fitted estimator.
    435     """
--> 436     X = self._validate_input(X, in_fit=True)
    438     # default fill_value is 0 for numerical input and "missing_value"
    439     # otherwise
    440     if self.fill_value is None:

File f:\anaconda\envs\mlp\Lib\site-packages\sklearn\impute\_base.py:361, in SimpleImputer._validate_input(self, X, in_fit)
    355 if "could not convert" in str(ve):
    356     new_ve = ValueError(
    357         "Cannot use {} strategy with non-numeric data:\n{}".format(
    358             self.strategy, ve
    359         )
    360     )
--> 361     raise new_ve from None
    362 else:
    363     raise ve

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'NEAR BAY'

Unfortunately, when we applied this to our data, we get the following error:

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float:

This is because the "median" strategy can only be used with numerical attributes so we need a way of only applying imputation to certain attributes. We could temporarily remove the categorical feature from our data to apply our function, or apply the function to a subset of the data and assign the output to the same subset.

However scikit-learn has a handy function to specify what column we want to apply a function to!

In [52]:
from sklearn.compose import ColumnTransformer

# Names of numerical columns
numcols = features[:-1]
print(numcols)
catcols = [features[-1]]
print(catcols)

num_cols_imputer = ColumnTransformer(
    # apply the `num_imputer` to all columns apart from the last
    [("num", num_imputer, numcols)],
    #("cat", cat_imputer, catcols) also work
    # don't touch all other columns, instead concatenate it on the end of the
    # changed data.
    remainder = "passthrough"
) 

num_cols_imputer.fit(X)
#use fit() to learn dataframe and later use fit_transform(X)

# Print the median values computed by calling fit
print("Computed median values for each numerical feature:")
print(num_cols_imputer["num"].statistics_)
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
['ocean_proximity']
Computed median values for each numerical feature:
[-118.49     34.26     29.     2127.      435.     1166.      409.
    3.5348]

After using .fit, our object now has a number of attributes, including statistics_ which stores the median value for each numerical attribute on the training set. This value can be used when validating and testing the model as it will be used if there is missing data in the new data.

Note

  • The fitted ColumnTransformer contains a list of transformers, stored in the attribute transformers_. We named the first transformer in the list num. Try running num_cols_imputer.transformers_ to see the names and types of the transformers in the list.
  • To access the fitted num_imputer in this case, num_cols_imputer["num"] is a shortcut to access the named transformer in the list.

Now, let's call transform to our fitted objected to impute the missing values.

In [53]:
X_ = num_cols_imputer.transform(X)
print("Number of Missing Values")
pd.DataFrame(X_, columns = features).isna().sum()
Number of Missing Values
Out[53]:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
dtype: int64

🚩 Exercise 10 (CORE)¶

In addition to median imputation, alter your transformer to also include a missing indicator. What is the shape of the transformed feature matrix? Use the method .get_feature_names_out() to print the names of the new features.

Note: You may want to add the option verbose_feature_names_out = False in your ColumnTransformer to reduce the length of the feature names.

In [54]:
# Code for your answer here!


num_imputer = SimpleImputer(strategy="median", add_indicator=True)

ct = ColumnTransformer(
    [("num", num_imputer, numcols)],
    remainder="drop",
    verbose_feature_names_out=False,
)

X_t = ct.fit_transform(X)
print(X_t.shape)
print(ct.get_feature_names_out())
(20640, 9)
['longitude' 'latitude' 'housing_median_age' 'total_rooms'
 'total_bedrooms' 'population' 'households' 'median_income'
 'missingindicator_total_bedrooms']

🏁 Now, is a good point to switch driver and navigator

Feature Engineering ¶

As discussed in the lectures, feature engineering is where we extract features from data and transform them into formats that are suitable for machine learning models. Today, we will have a look at two main cases that are present in our data: categorical and numerical values.

Feature engineering also requires a transformer class to alter the features.

Categorical Variables¶

  • In the dataset, we have an text attribute (ocean_proximity) that we already had to treat differently when cleaning and visualizing the data. This extends to feature engineering as well, where we need to use separate methods than those used with numerical variables.

  • If we look at the unique values of this attribute, we will see that there are a limited number of possible values which represent a category. We need a way of encoding this information into our modeling framework by converting our string/categorical variable into a numeric representation that can be included in our models.

If we have a binary categorical variable (two levels) we could do this by picking one of the categorical levels and encode it as 1 and the other level as 0.

However, in this case as we have multiple categories, we would probably want to use another encoding method. To illustrate, we can try encoding the the categorical feature ocean_proximity using both the OrdinalEncoder and OneHotEncoder available in sklearn.preprocessing.

Side Notes

  • The output of the OneHotEncoder provided in Scikit-Learn is a SciPy sparse matrix, instead of a NumPy array. These are useful when you have lots of categories as your matrix becomes mostly full of 0's. To store all these 0's takes up unneccesary memory, so instead a sparse matrix just stores the location of nonzero elements. The good news is that you can use a sparse matrix similar to a numpy matrix, but if you wanted to, you can convert it to a dense numpy matrix using .toarray().

  • The above does not seem to be the case if passed through a ColumnTransformer.

Comment by sone: sparse matrix is better than numpy if a dataframe have lots of zero. In One-hot Encoder one categorical feature with 4 unique values become 4 one-hot columns.

Comment by Sone: ColumnTransformer lets you apply different preprocessing steps to different columns.

To impute missing values in numerical columns, use a SimpleImputer in a "num" transformer.

To encode categorical columns, use an encoder (e.g., OneHotEncoder or OrdinalEncoder) in a "cat" transformer.

In [55]:
from sklearn.preprocessing import OrdinalEncoder
# Defining the OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

encoder = ColumnTransformer(
    # apply the ordinal_encoder to the last column
    [("cat", ordinal_encoder, catcols)],
    remainder="passthrough",
    verbose_feature_names_out=False)

# fitting the encoder defined above
X_ = encoder.fit_transform(X) 

# Accessing the fitted ordinal encoder (encoder["cat"]) to see how the categories were mapped
display(dict(zip(list(encoder["cat"].categories_[0]), range(5))))

# Display the first few rows of the transformed data
display(pd.DataFrame(X_, columns = encoder.get_feature_names_out()).head())
{'<1H OCEAN': 0, 'INLAND': 1, 'ISLAND': 2, 'NEAR BAY': 3, 'NEAR OCEAN': 4}
ocean_proximity longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
0 3.0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 3.0 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 3.0 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 3.0 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 3.0 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462
In [56]:
from sklearn.preprocessing import OneHotEncoder
# Defining the OneHotEncoder
onehot_encoder = OneHotEncoder()

encoder = ColumnTransformer(
    # apply the onehot_encoder to the last column
    [("cat", onehot_encoder, catcols)],
    remainder="passthrough",
    verbose_feature_names_out=False) 

X_ = encoder.fit_transform(X) 

# Display the first few rows of the transformed data
display(pd.DataFrame(X_, columns = encoder.get_feature_names_out()).head())
ocean_proximity_<1H OCEAN ocean_proximity_INLAND ocean_proximity_ISLAND ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
0 0.0 0.0 0.0 1.0 0.0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252
1 0.0 0.0 0.0 1.0 0.0 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014
2 0.0 0.0 0.0 1.0 0.0 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574
3 0.0 0.0 0.0 1.0 0.0 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431
4 0.0 0.0 0.0 1.0 0.0 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462

Comment by Sone:

OrdinalEncoder disadvantage:
Imposes an artificial order on categories Wrong distance effects(especially for linear models,K-NN)

OneHotEncoder disadvantages: Expands to many columns Can overfit when there are many rare categories.(using target encoder)

🚩 Exercise 11 (CORE)¶

a) What is the main difference between two methods regarding the obtained features? Which encoding method do you think is most appropriate for this variable and why?

b) How sensible is the default ordering of the ordinal encoder? Use the parameter categories of OrdinalEncoder to apply a different ordering.

(a) Main differences:

One-Hot: Expands into many columns, does not impose any order

Ordinal: maps categories to a single integer column, imposing an order and distances

Type your answer here!

In [60]:
# (b)Code for your answer here!


order = [['INLAND', '<1H OCEAN', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']]
ord_enc = OrdinalEncoder(categories=order)

encoder = ColumnTransformer(
    [("cat", ord_enc, catcols)],
    remainder="passthrough",
    verbose_feature_names_out=False
)

X_b = encoder.fit_transform(X)

print(X_b)
[[ 3.0000e+00 -1.2223e+02  3.7880e+01 ...  3.2200e+02  1.2600e+02
   8.3252e+00]
 [ 3.0000e+00 -1.2222e+02  3.7860e+01 ...  2.4010e+03  1.1380e+03
   8.3014e+00]
 [ 3.0000e+00 -1.2224e+02  3.7850e+01 ...  4.9600e+02  1.7700e+02
   7.2574e+00]
 ...
 [ 0.0000e+00 -1.2122e+02  3.9430e+01 ...  1.0070e+03  4.3300e+02
   1.7000e+00]
 [ 0.0000e+00 -1.2132e+02  3.9430e+01 ...  7.4100e+02  3.4900e+02
   1.8672e+00]
 [ 0.0000e+00 -1.2124e+02  3.9370e+01 ...  1.3870e+03  5.3000e+02
   2.3886e+00]]

🚩 Exercise 12 (EXTRA)¶

Another handy feature of OneHotEncoder and OrdinalEncoder is that infrequent categories can be aggregated into a single feature/value. The parameters to enable the gathering of infrequent categories are min_frequency and max_categories.

Use the max_categories attribute to set the maximum number of categories to 4. Use the get_feature_names_out() method of OneHotEncoder to print the new category names. Which two features have been combined?

In [ ]:
# Code for your answer here!
# Haven't use ColumnTransformer, since we only used ColumnTransformer when encoding needs to combined many columns
enc = OneHotEncoder(max_categories=4, handle_unknown="ignore")
enc.fit(housing[["ocean_proximity"]])

print(enc.get_feature_names_out(["ocean_proximity"]))
print(enc.infrequent_categories_)
['ocean_proximity_<1H OCEAN' 'ocean_proximity_INLAND'
 'ocean_proximity_NEAR OCEAN' 'ocean_proximity_infrequent_sklearn']
[array(['ISLAND', 'NEAR BAY'], dtype=object)]

Type your answer here!

🚩 Exercise 13 (EXTRA)¶

When there are many unordered categories, another useful encoding scheme is TargetEncoder which uses the target mean conditioned on the categorical feature for encoding unordered categories. Whereas one-hot encoding would greatly inflate the feature space if there are a very large number of categories (e.g. zip code or region), TargetEncoder is more parsimonious.

Use target encoding of ocean proximity. What are the numerical values assigned to the categories?

Caution: when using this transformer, be careful to avoid data leakage and overfitting by integrating it properly in your model pipeline! We will learn more about this later.

In [65]:
# Code for your answer here!

# Create a target encoder
from sklearn.preprocessing import TargetEncoder

tgc = TargetEncoder(target_type="continuous")
tgc.fit(X[["ocean_proximity"]],y)

cats = tgc.categories_[0]
vals = tgc.transform(pd.DataFrame({"ocean_proximity": cats})).ravel()
mapping = dict(zip(cats,vals))
mapping
Out[65]:
{'<1H OCEAN': 240081.20980262995,
 'INLAND': 124810.00113345706,
 'ISLAND': 367882.7345151113,
 'NEAR BAY': 259186.43556292081,
 'NEAR OCEAN': 249415.94571004034}

Numerical Variables¶

Feature Scaling¶

As we will discuss in later weeks, many machine learning algorithms are sensitive to the scale and magnitude of the features, and especially differences in scales across features. For these algorithms, feature scaling will improve performance.

For example, let's investigate the range of the features in our dataset:

In [66]:
fig, ax = plt.subplots(figsize=(15,5))

plt.boxplot(X[numcols], vert = False) 
plt.xscale("symlog") 
plt.ylabel("Feature") 
plt.xlabel("Feature Range")

ax.set_yticklabels(numcols)

plt.suptitle("Feature Range for the Training Set")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [71]:
fig, axes = plt.subplots(figsize = (15,10), ncols = (housing.shape[1]-1)//3, nrows = 3, sharex = True)
axes = axes.flatten()
#features_number := housing.shape[1]; ocean_proximity can't use := -1; // :=columns; 每行多少列
for i, ax in enumerate(axes):
    sns.boxplot(y = housing.iloc[:,i], ax = ax, whis=3) 
    ax.set_title(housing.iloc[:,i].name)
    ax.set_ylabel("")
    
plt.suptitle("Boxplots")
plt.tight_layout()
plt.show()
No description has been provided for this image

!!! Differences:

phase1: ->only numcols; many box-plots in a insight

There are various options in scikit learn for feature scaling, including:

  • Standardization (preprocessing.StandardScaler)

  • Min-Max Scaling (preprocessing.MinMaxScaler)

  • l2 Normalization (preprocessing.normalize)

  • RobustScaler(preprocessing.RobustScaler)

  • Scale with maximum absolute value (preprocessing.MaxAbsScaler)

  • As scaling generally improves the performance of most models when features cover a range of scales, it is probably a good idea to apply some sort of scaling to our data before fitting a model.
  • Standardization (or variance scaling), is the most common, but there are a number of other types, as listed above.

🚩 Exercise 14 (CORE)¶

Try implementing at least two different scalers for the total_rooms and total_bedrooms variables. Make a scatter plot of the original and transformed features to see the main differences.

In [ ]:
# Code for your answer here!
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler

cols = ["total_rooms", "total_bedrooms"]
X_raw = X[cols].copy()

X_std = StandardScaler().fit_transform(X_raw)
X_mm = MinMaxScaler().fit_transform(X_raw)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(X_raw[cols[0]], X_raw[cols[1]], s=8, alpha=0.3)
axes[0].set_title("Original")
axes[0].set_xlabel(cols[0])
axes[0].set_ylabel(cols[1])

axes[1].scatter(X_std[:, 0], X_std[:, 1], s=8, alpha=0.3)
axes[1].set_title("StandardScaler")
axes[1].set_xlabel(cols[0])
axes[1].set_ylabel(cols[1])

axes[2].scatter(X_mm[:, 0], X_mm[:, 1], s=8, alpha=0.3)
axes[2].set_title("MinMaxScaler")
axes[2].set_xlabel(cols[0])
axes[2].set_ylabel(cols[1])

plt.tight_layout()
plt.show()

#StandardScarlar: every columns mean value = 0; variance = 1
#MinMaxScalar:    every columns from 0 to 1
No description has been provided for this image

Power Transformation¶

In some cases, we may wish to apply transformations to our data, so that they have a more Gaussian distribution. For example, log transformations are useful for altering count data to have a more normal distribution as they pull in the more extreme high values relative to the median, while stretching back extreme low values away from the median. You can use a log transformation with either the pre-made LogTransformer() from feature_engine.transformation, or a custom function and sklearn.preprocessing.FunctionTransformer.

More generally, the natural logarithm, square root, and inverse transformations are special cases of the Box-Cox family of transformations (Box and Cox 1964). The question is why do we need such a transformation and when?

  • Note that, the method is typically used to transform the outcome, but can also be used to transform predictors.

  • The method assumes that the variable takes only positive values. If there are any zero or negative values, we can 1) shift the distribution towards positive values by adding a constant, or 2) use the Yeo-Johnson transformation (Yeo and Johnson 2000).

  • In general, transormations can make interpretations more difficult, thus you should think carefully if they are needed, particularly if they only result in modest improvements in model performance. Moreover, finding a suitable transformation is typically a trial-and-error process.

  • Moreover, if you are transforming the features, you should also consider how this alters the relationship with the target variable.

The Yeo-Johnson transformation is defined as:

$\tilde{y} = \left\{ \begin{array}{l l} \frac{(y+1)^{\lambda} - 1}{\lambda}, & \lambda \neq 0 \text{ and } y \geq 0 \\ \log (y +1), & \lambda = 0 \text{ and } y \geq 0 \\ -\frac{(1-y)^{2-\lambda} - 1}{2-\lambda}, & \lambda \neq 2 \text{ and } y < 0 \\ -\log (1-y), & \lambda = 2 \text{ and } y < 0\end{array} \right.,$

with the Box-Cox transformation as a special case (applied to $y-1$).

Because the parameter of interest is in the exponent, this type of transformation is called a power transformation and is implemented in sklearn's PowerTransformer. The parameter $\lambda$ is estimated from the data, and some values of $\lambda$ relate to common transformations, such as (for $y \geq 0$):

  • $\lambda = 1$ (no transformation)
  • $\lambda = 0$ (log)
  • $\lambda = 0.5$ (square root)
  • $\lambda = -1$ (inverse)
  • Using the code below, if lmbda=None then the function will "find the lambda that maximizes the log-likelihood function and return it as the second output argument"

  • Notice that we can not use lambda directly since it conflicts with the available object called lambda, this is the reason we preferred the indicator name as lmbda

In [ ]:
fig, axes = plt.subplots(figsize = (15,5), ncols = 4, nrows=2, sharey = True)
axes = axes.flatten()
sns.histplot(data = X['households'], ax = axes[0])
axes[0].set_title("Raw Counts")

for i, lmbda in enumerate([0, 0.25, 0.5, 0.75, 1., 1.25, 1.5]):
    
    house_box_ = stats.boxcox(X['households'].astype(float), lmbda = lmbda)
    sns.histplot(data = house_box_, ax = axes[i + 1])
    axes[i + 1].set_title(r"$\lambda$ = {}".format(lmbda))
    
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(figsize = (15,5), ncols = 4, nrows=2, sharey = True)
axes = axes.flatten()
sns.scatterplot(x = X['households'], y = y, ax = axes[0])
axes[0].set_title("Raw Counts")

for i, lmbda in enumerate([0, 0.25, 0.5, 0.75, 1., 1.25, 1.5]):
    
    house_box_ = stats.boxcox(X['households'].astype(float), lmbda = lmbda)
    sns.scatterplot(x = house_box_, y = y, ax = axes[i + 1])
    axes[i + 1].set_title(r"$\lambda$ = {}".format(lmbda))
   
plt.tight_layout()
plt.show()

We can find the $\lambda$ that maximizes the log-likelihood function using scipy's boxcox function or sklearn's PowerTransformer.

In [ ]:
# Find the MLE for lambda (using scipy's boxcox function)
house_box_, bc_params = stats.boxcox(X['households'].astype(float), lmbda = None)
print(round(bc_params, 2))

# Find the MLE for lambda (using sklearn's PowerTransformer)
from sklearn.preprocessing import PowerTransformer
power_transformer = PowerTransformer(method='box-cox', standardize=False)
X_boxcox = power_transformer.fit_transform(X[['households']])
print(round(power_transformer.lambdas_[0], 2))

🚩 Exercise 15 (EXTRA)¶

  • For the variable households, based on the boxcox transform shown above, do you think any of the values of $\lambda$ may be useful?

  • Apply a similar code snippet to median_house_value. Would any values of $\lambda$ be useful?

Type your answer here!

In [ ]:
# Code for your answer here!

Feature Combinations¶

  • Looking at the datas attributes we may also want to manually combine them into features that are either a more meaningful representation of the data or have better properties.

  • For example, we know the number of rooms in a district, but this may be more useful to combine with the number of households so that we have a measure of rooms per household.

In [ ]:
rooms_per_household = X['total_rooms'] / X['households']
rooms_per_household.describe()

🚩 Exercise 16 (EXTRA)¶

  • Can you think of other combinations that may be useful?

  • Create a custom transformer that creates these new combinations of features using the FunctionTransformer.


Hint What about the following?
  • population_per_household

  • bedrooms_per_room

In [ ]:
# Code for your answer here!

Other feature types¶

Feature engineering for other feature types beyond numerical categorical are also available in sklearn (e.g. for text and images) and feature engine (e.g. for Datetime and for time series).

🏁 Now, is a good point to switch driver and navigator

Combining into a Pipeline¶

Now, that we are familar with transformers, we are finally ready to create our first model pipeline!

Pipelines are very useful when we want to run data through our pipeline in the future; rather than having to copy and paste a load of code, we can just use our pipeline which combines all the steps. Later on the course, we will see this is important when we split our data into training, validation, and test sets, but this would also be required if you deploy your model in a "live" environment. In particular, pipelines help prevent you from data leakage, i.e. when information from your testing data leaks into your training or model selection. Data leakage is a common reason why many ML models fail to generalize to real world data. Furthermore, when refining a model, pipelines makes it easier for us to add or remove steps of our pipeline to see what works and what doesn't.

Its also worth examining what is meant by a "Pipeline". A general definition is that it is just a sequence of data preparation operations that is ensured to be reproducible. Specifically, in sklearn, Pipeline can contains a sequence of transformer or estimator classes, or, if we use an imbalanced-learn Pipeline instead, also resamplers. This week we have focused on transformers, but later on in the course we will learn about estimators and resamplers. All three of these objects (resamplers, transformers, and estimator) all typically have a .fit() method. We have already seen examples of calling .fit() on transformers. The method works similarly on other classes and is used to

  • validate and interpret any parameters,
  • validate the input data,
  • estimate and store attributes from the parameters and provided data,
  • return the fitted estimator to facilitate method chaining in a pipeline.

Along with other sample properties (e.g. sample_weight), the .fit() method usually takes two inputs:

  • The input matrix (or design matrix) $\mathbf{X}$. The size of $\mathbf{X}$ is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

  • The target values $\mathbf{y}$ which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervised learning tasks, $\mathbf{y}$ does not need to be specified.

https://scikit-learn.org/stable/getting_started.html

Other methods available for these objects other than .fit() will depend on what they are, e.g. .transform() for transformers, so we will learn about the methods for others objects later in the course.

This week, our focus is combining different feature engineering steps together to make different model pipelines.

  • Remember we want to create a pipeline that treats the numerical and categorical attributes differently.
  • We also need to supply the pipeline with an estimator (i.e. model). For now, let's use a linear regression model, which we will learn in more details in week 4.
In [ ]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

numcols = features[:-1]
catcols = [features[-1]]

num_pre = Pipeline([
    ("num_impute", SimpleImputer(strategy="median")),
    ("num_scale", StandardScaler())])

cat_pre = Pipeline([
    ("cat_encode", OneHotEncoder(drop='first'))])

reg_pipe_1 = Pipeline([
    ("pre_processing", ColumnTransformer([("num_pre", num_pre, numcols),
                                          ("cat_pre", cat_pre, catcols)], 
                                          verbose_feature_names_out=False)),
    ("model", LinearRegression())
])

# Alternative and equivalent model avoiding nested pipelines
# reg_pipe_1 = Pipeline([
#     ("impute", ColumnTransformer([("num_imp", SimpleImputer(strategy="median"), numcols),
#                                           ("cat_imp", SimpleImputer(strategy="constant"), catcols)])),
#     ("transform", ColumnTransformer([("num_trns", StandardScaler(), numcols),
#                                           ("cat_trns", OneHotEncoder(drop='first'), catcols)])),                                     
#     ("model", LinearRegression())
# ])

display(reg_pipe_1)
In [ ]:
reg_pipe_1.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(reg_pipe_1.score(X, y), 3))
In [ ]:
# Print the coeffcients
coef_df = pd.DataFrame({'coef': reg_pipe_1['model'].coef_}, 
             index = reg_pipe_1['pre_processing'].get_feature_names_out())
display(coef_df)

Let's try some other combinations of the pre-processing and feature engineering steps that we have learned about this week.

In [ ]:
# Reg Pipe 2 

# Define column indices
numcols = ['longitude', 'latitude', 'housing_median_age', 'median_income']
countcols = ['total_rooms', 'total_bedrooms', 'population', 'households']

# Reg Pipe 2 
num_pre = Pipeline([
    ("num_scale", StandardScaler())])

count_pre = Pipeline([
    ("count_impute", SimpleImputer(strategy="median")),
    ("count_transform", PowerTransformer(method='box-cox', standardize=True))])

cat_pre = Pipeline([
    ("cat_encode", OneHotEncoder(drop='first'))])

# Overall ML pipeline inlcuding all 
reg_pipe_2 = Pipeline([
    ("pre_processing", ColumnTransformer([
        ("num_pre", num_pre, numcols), 
        ("count_pre", count_pre, countcols),
        ("cat_pre", cat_pre, catcols)], verbose_feature_names_out=False)), 
    ("model", LinearRegression())
])


display(reg_pipe_2)
In [ ]:
reg_pipe_2.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(reg_pipe_2.score(X, y), 3))
In [ ]:
# Print the coeffcients
coef_df = pd.DataFrame({'coef': reg_pipe_2['model'].coef_}, 
             index = reg_pipe_2['pre_processing'].get_feature_names_out())
display(coef_df)
In [ ]:
# Reg Pipe 3
from feature_engine.transformation import LogTransformer
from sklearn.compose import TransformedTargetRegressor

numcols = ['longitude', 'latitude', 'housing_median_age']
skewcols = ['total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']

num_pre = Pipeline([
    ("num_scale", StandardScaler())])

skew_pre = Pipeline([
    ("skew_impute", SimpleImputer(strategy="median")),
    ("skew_transform", LogTransformer()),
    ("skew_scale", StandardScaler())])

cat_pre = Pipeline([
    ("cat_encode", OneHotEncoder(drop='first'))])

# Overall ML pipeline inlcuding all 
reg_pipe_3 = Pipeline([
    ("pre_processing", ColumnTransformer([
        ("num_pre", num_pre, numcols), 
        ("skew_pre", skew_pre, skewcols), 
        ("cat_pre", cat_pre, catcols)], verbose_feature_names_out=False)), 
    ("model", LinearRegression())
])
# Transform also the target variable
tt_reg_pipe_3 =TransformedTargetRegressor(regressor=reg_pipe_3,
                                  transformer=LogTransformer())

display(tt_reg_pipe_3)
In [ ]:
tt_reg_pipe_3.fit(X,y)
# Print the R squared (ranges 0 to 1, with higher values better)
print(round(tt_reg_pipe_3.score(X, y), 3))
In [ ]:
# Print the coeffcients
# Note: get_feature_names_out() does not work for LogTransformer
reg3_features = np.concatenate([tt_reg_pipe_3.regressor_['pre_processing']['num_pre'].get_feature_names_out(),
               tt_reg_pipe_3.regressor_['pre_processing']['skew_pre'].feature_names_in_,
               tt_reg_pipe_3.regressor_['pre_processing']['cat_pre'].get_feature_names_out()]
)

coef_df = pd.DataFrame({'coef': tt_reg_pipe_3.regressor_['model'].coef_}, 
             index = reg3_features) 
display(coef_df)

🚩 Exercise 17 (CORE)¶

Explain in words what are the differences in pre-processing and/or feature engineering steps used across the three model pipelines above.

Type your answer here!

🚩 Exercise 18 (EXTRA)¶

Try to create your own pipeline by modifying at least one of the pre-processing and feature engineering steps above. What have you decided to change and why?

Summary ¶

This week we covered a lot of ground!

  • We've looked at some methods for pre-processing our data, cleaning and preparing it, as well as how to engineer some features and combine these steps into a reproducible pipeline.

  • This is by no means a complete collection of all the methods available as covering more would go beyond the scope of this course (for those interested in learning more, have a look though the given companion readings).

  • For example, we did not touch on handling text and dates/time much. These topics are quite complex and have enough materials to cover their own courses.

Competing the Worksheet¶

At this point you have hopefully been able to complete all the CORE exercises and attempted the EXTRA ones. Now is a good time to check the reproducibility of this document by restarting the notebook's kernel and rerunning all cells in order.

Before generating the PDF, please change 'Student 1' and 'Student 2' at the top of the notebook to include your name(s).

Once that is done and you are happy with everything, you can then run the following cell to generate your PDF.

In [ ]:
!jupyter nbconvert --to pdf mlp_week01.ipynb 

Once generated, please submit this PDF on Learn page by 16:00 PM on the Friday of the week the workshop was given. Note that:

  • You don't need to finish everything, but you should have had a substantial attempt at the bulk of the material, particularly the CORE tasks.
  • If you are having trouble generating the pdf, please ask a tutor or post on piazza.
  • As a back option, if you are having errors in converting to pdf, then a quick solution is to export to html and then convert to pdf in your browser.